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Abstract 

In this paper, we compare the performance of two methods for esti- 
mating Bayesian networks from data containing exogenous variables 
and random effects. The first method is fully Bayesian in which a 
prior distribution is placed on the exogenous variables, whereas the 
second method, which we call the residual approach, accounts for the 
effects of exogenous variables by using the notion of restricted maxi- 
mum likelihood. We review the two score-based metrics, then study 
their performance by measuring the Kullback Leibler divergence, or 
distance, between the two resulting posterior density functions. The 
Kullback Leibler divergence provides a natural framework for com- 
paring distributions. The residual approach is considerably simpler 
to apply in practice and we demonstrate its utility both theoretically 
and via simulations. In particular, in applications where the exoge- 
nous variables are not of primary interest, we show that the potential 
loss of information about parameters and induced components of cor- 
relation, is generally small. 



Keywords: Bayesian network, Exogenous variables, Kullback Leibler diver- 
gence, Gene regulatory networks, Variance components 



Comparison of Bayesian network scores 



2 



1 Introduction 

Methods for the estimation of Bayesian networks, which encode conditional 
independence relationships of a set of variables, have, until recently, assumed 
data sets that consist of independent and identically distributed samples, as 
described in Chapter 16 of [9|. These methods may be split into two cat- 
egories, called constraint-based and score-based methods, CE51 HZ]. Re- 
cent work by Kasza et al |S|, has extended the applicability of score-based 
methods to data sets which do not necessarily consist of independent and 
identically distributed samples. These authors developed two score metrics 
which are extensions of the BGe metric of Geiger and Heckerman, [5], for 
use in conjunction with score-based methods, to account for complex sam- 
pling structures and additional components of variance. The first metric, 
called the Bayesian score metric, involves placing a prior distribution on the 
effects of exogenous variables. The second metric, inspired by the notion of 
restricted maximum likelihood, and called the residual score metric, is non- 
parametric in the effects of exogenous variables. These two score metrics lead 
to different posterior distributions for Bayesian network parameters, and a 
formal comparison of these posterior distributions is necessary to determine 
if the residual approach provides a useful alternative to the (fully) Bayesian 
approach. This comparison is the subject of the present paper. 

In Section [2J, score-based estimation of Bayesian networks is briefly re- 
viewed, as are the Bayesian and residual score metrics. The posterior distri- 
butions obtained using each score metric are also presented here. In Section 
[3] the posterior distributions are compared using the Kullback Leibler di- 
vergence, which in general provides a useful basis for comparing probability 
density functions. The comparison of the posterior densities based on the 
Kullback Leibler divergence provides justification for the use of the resid- 
ual score metric in the estimation of Bayesian networks, both theoretically, 
by simulations and the analysis of data on grape-berry heat-shock genes in 
Section HI 

2 Learning Bayesian networks and estimat- 
ing parameters 

Bayesian networks were first introduced by Pearl in [T2j. A Bayesian net- 
work B = (Q, 0), 9 = . . . , 9 p }, for a random vector X = . . . , X p ) T 
consists of two components: a directed acyclic graph associated with X, 
Q = (V, E), with V = {Xi, . . . , X p }, E C V X V, and a set of conditional dis- 
tributions {f(xi\x P .,9i)\i = 1, . . . ,p}. The set Pi consists of those variables 
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Xj such that there is a directed edge from j to i in Q: Pi — {Xj\(j, i) G E}. 
The joint distribution for X may then be written as 

p 

f(x\e) = Y[f(x i \x Pi ,e i ). 

i=i 

We make the assumption that X\Q ~ -/V(0, £). Bayesian networks are par- 
ticularly useful as they allow the estimation of covariance matrices for high- 
dimensional data sets, which contain fewer samples than random variables, 
since £ can be estimated from the Bayesian network. Additionally, the di- 
rected acyclic graph of a Bayesian network encodes information about the 
conditional dependence relationships between the variables in X. The di- 
rected Markov properties, as described in Lauritzen [11], for example, allow 
for more conditional independence relationships to be read directly from the 
graph Q than could be read from 

Estimation of a Bayesian network for X given a data set d requires es- 
timation of the parameters G and learning the structure of Q. To learn the 
structure, score-based methods move through the space of directed acyclic 
graphs, attempting to find the graph that maximises some score metric. An 
obvious choice of score metric is the likelihood of a graph, however, the struc- 
ture that maximises the likelihood is the complete directed acyclic graph, 
encoding no conditional independence relationships, [9]. Bayesian score met- 
rics such as those considered here avoid this problem of over-fitting. When 
Bayesian score metrics are used to learn structure, parameters may be esti- 
mated using Bayesian techniques. 

The Bayesian score of a directed acyclic graph Q for a random variable X 
is defined to be proportional to the posterior probability of the graph given 
the data set d. |6j: 

s(g\d) = p(G) P (d\g) = p{G) [ P (d\g, &) P (e\g)dG, (i) 

Jv{ P ) 

where p(G) is the prior probability of the graph g, p(d\g) is the marginal 
likelihood of the data given the structure, and V(p) is the space of symmetric 
positive-definite p x p matrices. We will not consider p(G) any further. 

Given the acyclicity of the graphs considered, p(d\g, 0) = n?=i f( x i\ x Pn 
where £Cj is the n- vector of samples of Xj , and x p i is the n x | P t | matrix of 
samples of the parents of X$ in the graph g. The usual assumption is that 
the n samples are independent and identically normally distributed: 

Xi\x Pi , 7^ ~ N(x Pt -f i ,il) i I). 
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To get the score metric in Equation (jTJ), prior distributions are required 
for 7^ and fa. As shown by Geiger and Heckerman, [5], in the case of iid 
samples, to obtain a score metric that scores graphs that encode equiva- 
lent sets of independence relationships identically, a property known as score 
equivalence, the choice of priors for ^ i and ipi is limited to priors of the form 

/§ _l_ \p.\ T \ 

TilV^i ~ ^|P 4 |(0, t~ ipj), ijji ~ Inverse Gamma ( — — -, -J. (2) 

Given these priors, the BGe score metric of [5J is obtained, which we de- 
note So(G\d) = p{Q) rii=i fo( x i\ x Pi)- The expression for jo is provided in 
Appendix A. 



Non-independent and identically distributed data 

Often the available data set will be more complex, with non-independent 
samples, or a complex mean structure including exogenous variables as ran- 
dom effects . Such additional complexities may be accounted for through the 
inclusion of m exogenous variables in the model, [7], [8]. If Q is the n x m 
matrix containing data on m exogenous variables, we assume 

Xilxp.^^ip^bi ~ N(x P .'y i + Qb i} ip i I), 

where the elements in bi are called the effects of the exogenous variables. 

In addition to priors for f y i and ipi, a prior is required for bi, the effect of 
the exogenous variables. Kasza et al j8] note that bi may be dealt with in 
two ways, leading to two different score metrics. To satisfy score equivalence, 
these approaches both use the priors in Equation (P2J) for j i and fa. 

The first approach, called the Bayesian approach, is to place a prior dis- 
tribution on 6j. An extension of a result in [6] implies that in order for score 
equivalence to hold, if var(bi) = E^^ b^E^ must be normally distributed. 
We consider prior distributions for bi that are of the form bi\ipi ~ A' r m (0, ipiV), 
since these are the only priors that result in a score metric with a closed form. 
The Bayesian score metric is given by Sb((?|g£) = p(Q) nf=i fv( x i\ x Pi), where 
fv is given in Appendix A. 

The second approach, called the residual approach, is non-parametric in 
bi. This removes the effects of exogenous variables by using linear combi- 
nations of residuals obtained after regressing Xi on the columns of Q. This 
is achieved by pre-multiplying each Xi by P T , where P is an n x (n — m) 
matrix such that P T Q = 0, P T P = I, PP T = I -Q (Q T Q) 1 Q T ■ It can 
then be shown that P T Xi ~ A" n _ m [P T xp i j i , r ip i l) . The residual approach 
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is related to restricted maximum likelihood estimation, and is particularly 
advantageous when the effects of the exogenous variables are included to 
improve the estimation of a Bayesian network for X, but are not of in- 
trinsic interest in themselves. Additionally, when the prior covariance ma- 
trix of bi cannot be accurately specified, or when the assumption of a nor- 
mal prior distribution for the 6, is not warranted, the residual approach is 
preferable to the Bayesian approach. The residual score metric is given by 
Sr(G\(1) = p(Q) nLi fR(xi\x Pj ), and f R is shown in Appendix A. 

Having used either the Bayesian or residual score metric for learning 
the graphical structure, parameter estimates may be obtained from posterior 
distributions. Since posterior estimates of bi are unavailable from the residual 
approach, we only consider the posterior distributions of 7 i and fy. 

Using the priors from the Bayesian approach, the likelihood given in Equa- 
tion (j2]) and Bayes' theorem, the following posteriors are obtained: 

j i \ip i ,x i ,x Pi ~ 

Mb = 

ip i \x i ,xp i ~ 

Pb = 

The joint posterior density obtained under the full Bayesian approach is 
denoted by f B {li,^i\ 

Similarly, using the residual approach, the posteriors can be shown to be 

y i \ipi,x i ,x Pi ~ 

= 

ip i \x i ,x Pi ~ 

0R = 

The joint posterior density obtained under the residual approach is denoted 
by fR(nfi^i\xi,x Pi ). 

The residual approach does not require the specification of any hyperpa- 
rameters relating to 6,, making it easier to use than the Bayesian approach. 
Given that in the Bayesian approach, the variance of 6j is dependent upon 
tpi, and in turn related to the variance of "y i , we may obtain less information 
about these parameters when the residual approach is used instead of the 



N\Pi\ (/*b» V>< ( r/ + xT Pl H vx Pl ) 1 

(tI + X%.H V X Pi ) 1 X T p% EyXi, 

Inverse GannnaP + " + |P ' 1 , ft 



T - + -xj H v x l - -xjH v x Pi (rJ + x T Pi H v x Pi ) 1 x T P .H v Xi. 



N\ PA [ti^ {rI + x T P PP T x Pi y l 
(tI + xl i PP T x Pi )' 1 xl z PP T x l , 
Inverse Gamma ^— — — — — , (3p 

1 + ± X JPP T XI - l -x T i PP T x Pi ( T J + x T Pi PP T x Pi )- 1 x T Pi PP 
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Bayesian approach. It is important to quantify the difference between the 
Bayesian and residual approaches in this respect, and this is done in the next 
section by measuring the Kullback-Leibler distance between /s(7j, i^i\xi,x Pi ) 
and fiidi^ilx^XPi). 

3 Comparison of approaches 

Using the Kullback-Leibler divergence as a measure of the distance between 
the density functions, we show that the distance between the posterior den- 
sities for the Bayesian network parameters 'y i and ipi obtained under the 
Bayesian and residual approaches is generally small, and decreases as the 
sample size increases. In this way, theoretical justification for the residual 
approach is provided. 

The Kullback-Leibler divergence, [TU], between fsili, i/>i\xi,XPi) and fn(*Yi, ^i\Xi,x Pi ) 
is given by 

D(f B ,fn)= [ log ( A \ Xl ' XPl \ ) M 7 », 1>i\xi, xp^drtMi- 

./(O.oo) xRl^l IjRhi^ilXhXp) J 

The exact formula is set out in Appendix B. Instead of just considering the 
divergence associated with j { and ipi associated with a given X±, the diver- 
gence associated with S, the covariance matrix of X after marginalising over 
bi, may be obtained. The divergence between fp(Ti\X), the posterior den- 
sity of £ obtained under the Bayesian approach, and fn(H\X), the posterior 
obtained under the residual approach, is then available. 

Lemma 3.1. If the underlying graphical structure of X is known, the diver- 
gence between /b(E|X) and fp(E\X) is given by 

p 

D s {f B (Z\X), f R (Z\X)} = J2 D {/b(7o &\xi, x Pi ), /*( 7i , MX*, x Pi )}. 

t=i 

If the underlying graphical structure of X is not known, bounds for the di- 
vergence are given by the divergence for the covariance matrix corresponding 
to a graph with no edges: 

p 

D Z = D {fB{li,A\Xi), f R {~1i,ll)i\Xi)} , 
i=l 

and the divergence for the covariance matrix of an arbitrary full graph: 
p 

D i = ^2 D {/s(7i, A\xi, xx, . . . , Xi-i), f R (ji, ^i\x i: x u . . . , x^)} . 
i=i 
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Proof. This result follows directly from the properties of the Kullback Leibler 
divergence. □ 

Our main result is the following Theorem which justifies the use of the 
residual approach instead of the Bayesian approach: 

Theorem 3.1. As n oo, L> s {f B (Z\X), f R (E\X)} -»■ 0. 

Proof. See Appendix C. □ 

This Theorem tells us that as sample size increases, the posterior densi- 
ties obtained when using the residual metric more closely approximate those 
obtained using the fully Bayesian approach. Hence, provided the sample size 
is large enough, the residual approach offers a useful alternative to the fully 
Bayesian approach. 



4 Examples 

In this section, the residual and Bayesian approaches are compared using 
the Kullback-Leibler divergence for some specific data sets. We first consider 
simulated data sets and then consider a data set consisting of expression 
levels of grape heat-shock genes. 



4.1 Example 1 

In this example, multiple data sets were simulated from the following system 
of linear recursive equations: 

i=i 

bi=(b lU b i2 ) T ~ iV(0,^n 
7< = (7i,i > • • ■,li-i,i) T ~ N(0,7pil), ipi ~ Inv. Gamma(l, 2) 
i = 1,...,20, j = 1,2, fe = l,...,n, 

where the only non-zero 7^s were those corresponding to the edges in the 
graph of Figure [Q, and V = v~ l L One hundred data sets were simulated 
according to this model for each pair (n,v), where n = 5, 10, 20, 50, 100 and 
v = 0.001, 0.01, 0.1, 1, 10, 100. For each of the simulated data sets, d£, Df., 
and the divergence corresponding to the true structure were calculated. The 
key results are summarised in Figure El 
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Figure 1: Connected components of the underlying graph of Example 1. 



As the true graph is quite sparse, the true divergence is closer to that 
of the empty graph than that of the full graph. For all values of v, as the 
sample size increases, the divergence decreases, and for all sample sizes, as v 
increases, the divergence increases. When v is large, the samples are "similar" 
to independent and identically distributed samples, and the fully Bayesian 
approach allows for this, whilst the residual approach cannot. In these situa- 
tions, the exogenous variables are over-corrected for when the residual metric 
is applied. For larger sample sizes, Figure [2] shows that the divergences ob- 
tained for the empty and full graphs provide reasonable approximations to 
the divergence associated with the true structure. 

These observations are useful in providing guidelines for the use of the 
residual approach for a given data set. If v is small, no matter what size 
the ratio n/p is, the posterior distributions obtained under the Bayesian and 
residual approaches will be close to each other. In the case where n/p is 
small, provided v is large, a similar conclusion is reached. However, for data 
sets with small values of n/p, if the effect of exogenous variables are a priori 
thought to have small variances, the residual approach should be used with 
caution. 

4.2 Example 2 

When the Bayesian approach is used, not much information is available to 
guide prior specification of the covariance matrix of the effects, so iid ran- 
dom effects are usually assumed. In this example, we show that there exist 
situations where the residual posterior density is closer to the posterior ob- 
tained using the data-generating prior, than the posterior density obtained 
by assuming iid random effects. 
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Figure 2: The results of Example 1. The open circles represent the median 
value of .Df;, the filled circles the median value of Dj,, and the triangles the 
median loss associated with the true structure, of the 100 simulated data sets 
for each (n, v) pair. The vertical bars represent interquartile ranges. Note 
that the vertical scales of the plots differ. 



Data sets are simulated from the following system of linear recursive equa- 
tions: 

i-1 3 

X i:j = ^ likX k j + ^2 qr i hir + e *i' e v ~ N (°> 

fe=l r=l 

6i ~ N 3 (0, ipiV), Ji = ( 7 i,fe, • • • , 7^i,fc) T ~ ^(0, ^J), & ~ Inv. Gamma(l, 2) 
z = 1,...,10, j = l,...,100, 

where the only non-zero 7jfcS were those corresponding to the edges in the 
graph of Figure [TJ and the q r j are constant across data sets, having been 
simulated from a standard normal distribution. One hundred data sets were 
simulated according to this model for each of the following selections for V: 




V = I 3 ,V 1 = 1 \ ,V 2 = 1 i ,V, 




10 

j_ 

. 10 




Let /„ denote the posterior distribution obtained under the Bayesian ap- 
proach when the prior covariance matrix of the effects of exogenous vari- 
ables is v^ 1 !. For each data set, D^(fB,fv) — -Ds(/b,/r) is calculated 
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Figure 3: Medians (thick lines) and upper and lower quartiles (thin lines) 
of -De(/b, f v ) — /r) for a range of values of v, for the 100 data sets 

generated for each of the scenarios described in Example 2. Blue lines corre- 
spond to dependent effects of exogenous variables, black lines to independent 
effects. Solid lines correspond to heteroscedastic effects of exogenous vari- 
ables, dashed lines to homoscedastic effects. 



for values of v between 0.0001 and 10, given the true covariance matrix 
of 6j. Figure [3] summarises the median value, and the upper and lower 
quartiles of D^(fg,f v ) — D^(fg,fn) for the 100 data sets simulated un- 
der each of the four scenarios. The solid lines in Figure [3] correspond to 
the scenarios where the effects of exogenous variables are heteroscedastic, 
and the black lines correspond to the scenarios with independent effects. 
When Ds(f B , f v ) — Ds(/b> fn) is positive, insufficient variation in the data 
is accounted for by assuming iid effects of exogenous variables with variance 
v^ 1 tpi. As can be seen in Figure |31 this happens for all scenarios with in- 
creasing probability as v increases. Similarly, when Ds(/b, fv) — -Ds(/s, fa) 
is negative, the residual approach removes too much of the variation in the 
data. Given the amount of prior information typically available about the 
covariance structure of the effects of exogenous variables, this example shows 
that use of the residual approach will often be preferable to assuming inde- 
pendent and identically distributed effects. 



Comparison of Bayesian network scores 



11 



4.3 Grape-berry heat-shock gene example 

We now consider a data set consisting of samples of the expression levels of 
grape genes, previously discussed in [8]. This data set consists of n = 50 
expression levels of each of p = 26 grape genes, where the grapes themselves 
were sampled from three different vineyards located in different wine grow- 
ing regions of South Australia, Australia. These 26 genes are heat-shock 
genes, see [IB], the expression levels of which are known to be associated 
with changes in ambient temperature. Accordingly, air temperature at each 
vineyard was recorded every hour from 5.5 hours to 0.5 hours before the 
grapes were sampled. 

The data set considered here is a subset of a larger data set obtained 
from an Affymetrix chip microarray experiment conducted over the course 
of three years. Gene expression values were obtained from 174 grape berry 
tissue samples: 68 of these tissue samples were taken from one vineyard, 68 
from the second vineyard, and 38 from the third. At the first two vineyards, 
four grape-berry tissue samples were selected each week for 17 weeks, while at 
the third, 2 grape-berry tissue samples were selected each week for 19 weeks. 
At each of the vineyards, the first samples were taken at fruit set, when the 
fertilised grape flowers began to form berries. Samples were then taken each 
week for a pre-specified number of weeks. In this way, gene expression levels 
were measured over the course of the development of the grape berries. Of the 
174 samples taken, 162 had complete temperature records. The data analysed 
consist of the samples from each vineyard taken in the third to seventh weeks 
of sampling, inclusive. The samples from these weeks correspond to a period 
after fruit set, but before veraison, and it is thought that the relationships 
between expression levels of genes are relatively stable during this period of 
berry development, [3"| IT3"]. 

Let Xij be sample j of gene i, i — 1, . . . , 26, j — 1, . . . , 50, and let q r j be 
the data associated with sample j of exogenous variable r, where m exogenous 
variables are included in the model. Then the following model is assumed 
for each sample of each gene: 



ni 



l£Pi r=l 




(3) 



For the grape-berry genes under study here, temperature, which has been 
observed directly at the different vineyards, is a known driver of biological 
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activity. Moreover, when two or more genes respond similarly to the same 
driver of biological activity, the effect is to produce a component of corre- 
lation between the corresponding gene expression levels. Thus we should 
study the effects of temperature as an exogenous variable. There are also 
likely to be additional variables which do not correspond directly to a single 
biological factor such as temperature. For example, the three vineyards are 
likely to differ in a number of features such as soil type and fertility, moisture 
and other micro-climate conditions, each of which could potentially influence 
the expression levels of certain sets of genes. Here the three vineyards are 
separated by large regional distances, but share the same macro-climate in 
southern Australia. Thus, vineyards should be modelled as an additional 
exogenous variable with potentially considerable heterogeneity. 

It is unlikely that the effects of temperature and vineyard are independent 
and identically distributed. While such a claim may be valid for either the 
temperature effects or the vineyard effects alone, it is highly unlikely that 
the effects of temperature and vineyard are identically distributed. There 
may also be some dependence between the temperature and vineyard effects. 
Given the difficulty in specifying a joint prior variance matrix of the tem- 
perature and vineyard effects, consideration of models including both tem- 
perature and vineyard effects simultaneously are unlikely shed light on the 
performance of the residual approach to the estimation of Bayesian networks. 
We therefore proceed by considering simple models, fitting temperature and 
vineyards as separate exogenous variables. 

Firstly, we consider the vineyards only model, where m = 3 and in which 
we are interested in the temperature-induced correlations between genes, and 
secondly, the temperature only model, where m = 6, where we do not re- 
move the components of correlations induced by the vineyard micro-climates. 
Note that we are ignoring any temperature trend-component in all our mod- 
els. Although models containing both temperature and vineyard effects may 
potentially be of interest, the effects may be confounded as explained above, 
and there is a risk of over- fitting the data. In fact, for the full interaction 
model fitted to the grape-berry gene data, the effective sample size, n — m, 
would be zero, and the Kullback Leibler divergence could then be substan- 
tially artificially inflated. 

Since neither the true network nor the true value of v is known for this 
data set, the bounds _D| and D(, are calculated for a range of values of 
v. The results are shown in Figure HI The left-hand graph in the figure 
displays the loss of information when the three vineyard effects only are 
included in the analysis and the right-hand graph displays the divergence 
when only the six main temperature effects are included. Figure H] indicates 
that for either model and all considered values of v, if the true underlying 
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Figure 4: Upper and lower bounds of the divergence for the marginal covari- 
ance matrix of the 26 grape genes, when vineyards and then temperatures 
are included as exogenous variables in the analysis. In both graphs, the solid 
line is the divergence corresponding to the empty graph, and the dashed line 
is the divergence corresponding to the full graph. 



graph is thought to be sparse, as many biological networks are thought to 
be, the loss of information about the marginal covariance matrix when the 
residual approach is used will be minimal. If the true graph of the expression 
levels of the genes is thought to be dense, for larger values of v, the figure 
shows that the divergence for the temperature model will be less than that 
associated with the vineyard model. The temperature model is naturally 
likely to be more explanatory, with the higher number of exogenous variables 
fitted. For either model, the Kullback-Leibler divergence is small and the 
residual approach metric is of demonstrable practical utility. 

5 Conclusion 

Using the Kullback-Leibler divergence, we have compared two methods for 
estimating Bayesian networks for data containing exogenous variables and 
random effects. Provided that the sample size is not too small in a statis- 
tical sense, we can conclude that the residual score metric offers a useful 
alternative to a fully Bayesian approach, with the posterior density functions 
of key parameters obtained under the two approaches being generally close. 
Many contemporary bioinformatics studies are conducted using substantial 
sample sizes, often based on many hundreds of samples or patients. Even 
with smaller studies however, the results of our simulations and data analysis 
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provide confidence that the residual estimation approach will perform well 
with small samples in the presence of exogenous variables. 
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Appendix A 

foi^XilxpJ is given by the pdf of 

fvixilxpj is given by the pdf of 

*<WI (°> s+\P-\ { Hy ~ HyXPi ( r/ + xT ^ H v x PiY l x T p t H v } ^ , 

H v = I - Q (V- 1 + Q T QY l Q T . 
f R (P T Xi\P T x P% ) is given by the pdf of 

Appendix B 

The Kullback Leibler divergence between /s(7j, ipi\xi, Xp^ and fpi^i, ipi\xi, 
is given by 
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DUbJr) 



- Vb) T {rl + x T P .PP T x P% ) {n R - fi B ) 



IP 
2 



1 S + n+ IPI 



+ 



+ 



(3b 4 
5 + n — m + I Pj I 



log 



B 



+ log 



S + n+lPA f(3 



— 1 ) + — Digamma 

Pb 




Appendix C 

Here we prove Theorem 13.11 By Lemma 13. 1\ we need only consider the di- 
vergence for the parameters for one regression: D {/b(7i, ^i\xi, x Pj ), /ii(Ti, il>i\xi, x Pi)}- 

Assume that each ccj is centred and scaled, so that xjxi = n — 1, and 
note that xjHyXi < n — 1 and xfPP T Xi < n — 1. 

First, consider the log determinant term in 



irTT^I PP T £C P .| 



log 



^ log { | (rl + x T Pi H v x P y (rl + x T p PP T x p ) | } 

log | (rJ + x T Pi H v x p y l (rl + x T P PP T x Pi ) 

I-\l-(rI + xl.HvXp^ 1 (rl + x T p PP T x p ) 



1 

— tr 
2 
1 

--tr (log 



1 

— tr 
2 



using the Taylor series expansion this can be written as 

t{I ~ (rl + a^fTyXjO" (rl + ^ PP^p) } 

fe=i 

If second- and higher-order terms are ignored, this becomes 

U r {/ _ [rl + x T P fl v x p y X (rl + x T p PP T x p ) } 

= ^ - \tr { (rl + x T Pi H v x Pi y l (rl + x T p PP T x p ) } ; 

terms which cancel with other terms in D. 
Note also that 



5 + n — m + IP,- 



, 'h 



(5 + n — m + I P» I 



El ( P b ~ fin 



k=l 



0. 



B 
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and since < 1, second- and higher-order terms in Ylh=i \ { ^ B ^b ) 

may be ignored, cancelling with other terms in D(Jb, Jr) so that the diver- 
gence becomes 

DUbJr) = Tr * 5 + U t ^ (Vr ~ VbY { rI + xl.PP T x Pi ) (n R - ix B ) 
Pb 4 

r / S+n-m+\Pj\ \ ~\ 

j V 2 / m -^- f S + n+lPA 

+ log < — ^-p r-?- > H Digamma 

^ r f s+n±\Pjl \ ( 2 



2 

Let n* = <5+ "+l p 'l ; anc i no t e that as n approaches infinity, so too does n t 
From [16] . as — > oo, 

r (".-?) _(„,)-? J 1 + ^±^ +0 fi 



Hence, for large n*, 

r(n,-f)l m . . , f m(m + 2) /l 



From [T] , for large values of n* 
Digamma (n.) = log (n.) - J- - + ^ - ^ + O (1) . (5) 

Hence, as n — > oo, log | p^+n+|pj^ j + y Digamma ^ (5+n +l p 'l j approaches 
zero. All that remains to consider is the quadratic term: 

1 S ~\~ TL ~\~ I P^ I. \ T / T T \ / \ 

- Oil - Mb) (tJ + Xp.PP aJ P J - /x B ) . 



/3b 4 
Using the following approximations, 

1 



(t + x^H v x Pi ) 1 « diag 



(x + ^PP^)- 1 « diag( r + ^_^ (QTg) _ lgT;c J 



where k <G p, we may write 



ft* « - r + xf x z - xjQ {V- 1 + Q T QY Q T Xi - £ -i— rmy . unTnr i nr J 

2 ^ p T + x k x k -x k Q{y v + Q L Q) Q 1 Xk 
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and 



(Vr - Vb) T (tI + x T Pi PP T x Pi ) (fj, R - n B ) 

f (xjPP T x k ) 2 1 _ 2 sr^ f xjH v x k x i PP T x k 



; k r + x\x k - x[Q (Q T Qy 1 Q T x k J r + x\x k 

J (xj H v x k f (r + xfc* _ x t q (gT g) -i gr^ I 
+ { (r + a£a: fc - x T k Q (V~^ + Q T QY 1 Q T x k y j ' (6) 

As n increases, S+n ^ P ^ approaches 1, each of the terms in Equation ([6]) 
approaches zero, proving the result. 



